Epidemiologic Surveillance Data Processing


This guide aims to provide step-by-step instructions on how to conduct common data processing tasks for infectious disease surveillance using the open-source programming language R and the integrated development environment (IDE) for R, RStudio.


1 The Basics

1.1 For whom is this guide?

This guide is primarily targeted towards field epidemiologists who…

  • Do not have access to proprietary statistical software such as SAS and Stata
  • Do not have any solid background in any statistical programming language
  • Work collaboratively as part of routine infectious disease surveillance

1.2 Getting started with R

1.3 Tidy data principles

Often, the data we have as part of epidemiologic surveillance may not be the version ready for any type of statistical analysis. This raw data MUST be processed into tidy data. At the basic level:

  • Each row = one observation representing the unit of analysis
  • Each column = one variable
  • Each cell = standard data format, usually defined by a coding manual
  • If there are going to be multiple tables, there should be an identifying (ID) variable linking tables together.

Tidy data must not have:

  • Any blanks, unless these are true missing data
  • Too many special characters, unless absolutely necessary
  • Merged cells ANYWHERE - merged cells may be good visually but not for analysis
  • Colors to identify variables - these must be defined as a new column (variable), as colors cannot are read into analysis

For example, say we have five COVID-19 confirmed cases, and our raw data looks something like this on a spreadsheet:

Raw data from surveillance example

This raw data file:

  • Does not have a standard format for the cells - the dates are all encoded inconsistently
  • Has merged cells horizontally and vertically
  • “Flattens” the tests together with the cases
  • Has colored cells but no explanation - in this case, the yellow ones were the latest reported cases (in this hypothetical case, it is Oct 2) and the rest of the rows have no indication of when they were reported

We should split the data into two tables: one where each row is a case, another where each row is a test. The two tables are linked by a common, static ID. A tidy data version of the file above could look something like this instead:

The first table (each row = confirmed case)

ID DateOnset Municipality Community DateReport
1 2020-09-27 Funky Town Highland Z 2020-10-01
2 2020-09-26 Funky Town Highland Y 2020-10-01
3 2020-09-28 Providence People Village 2020-10-02
4 2020-09-25 Border Town Crescent Hill 2020-09-30
5 2020-09-30 New Horizons Block A1 2020-10-02

The second table (each row = test)

ID DateTest Result
1 2020-09-30 Positive
2 2020-09-30 Positive
2 2020-10-02 Positive
3 2020-10-01 Positive
4 2020-09-29 Positive
4 2020-10-03 Negative
5 2020-10-01 Positive

Additionally, there should be some sort of coding manual. For example:

  • ID: Unique ID assigned to each confirmed case (when a case has been assigned two ID numbers, discard the latest ID number and move on - DO NOT shift ID numbers upward)
  • DateOnset: Date of symptom onset as reported by the patient (format: YYYY-MM-DD)
  • Municipality: Municipality indicated in the current address reported by the patient (Names according to official geographic listing of national statistical authority)
  • Community: Community indicated in the current address reported by the patient (Names according to official geographic listing of national statistical authority)
  • DateReport: Date when case was officially reported to the surveillance system (format: YYYY-MM-DD)
  • DateTest: Date when case was swabbed for confirmatory testing (format: YYYY-MM-DD)
  • Result: Result of test conducted (Positive, Negative, Equivocal, Invalid)

We may prepare the tidy data using a spreadsheet program like Microsoft Excel, which may be familiar to most people. But we do not recommend this as we want our data processing to be reproducible. We want every modification of our data accounted for, and we want our rules on data processing to be clear that anyone else looking at the data may be able to follow along.

To learn more about tidy data, refer to the following reference by Hadley Wickham (Paper) (Video).

1.4 Overview of functions

In this guide, we will learn to use the following functions:

  1. Base R
  • colnames(): Names columns in a data frame
  • head(): Shows the first six rows of a data frame
  • expand.grid(): Creates a data frame from all combinations of two vectors
  • install.packages(): Downloads packages from the Comprehensive R Acrhive Network (CRAN)
  • library(): Loads packages in current environment
  • max(): Takes the maximum value of a vector; min() takes the minimum
  • merge(): Merges two data frames by common columns
  • nchar(): Calculates the number of characters in a string value
  • str(): Displays the structure of any object
  • rbind(): Combines rows of two data frames, must have the same columns; cbind() combines columns, must have the same rows
  • which.max(): Takes the row index of the maximum value of the vector
  1. tidyverse (specifically, dplyr)
  • arrange(): Sorts data frame based on specified column
  • case_when(): Vectorizes if/else statements. For those familiar with SQL, this is similar to the CASE WHEN statement.
  • filter(): Filters rows based on certain conditions
  • group_by(): Groups rows based on a values of the specified variable
  • mutate(): Creates new columns
  • relocate(): Arranges columns in a data frame
  • rename(): Renames column names (syntax: newname = oldname)
  • select(): Selects columns in a data frame
  • summarize(): Summarizes data across specified columns, often used together with group_by()
  • We also learn two operators:
    • The pipe %>% operator, which “pipes” arguments in a function to make code more readable
    • The %in% operator, which makes filtering values easier
  1. readxl
  • read_xlsx(): Imports Microsoft Excel spreadsheet files in .xlsx format
  1. writexl
  • write_xlsx(): Exports a data frame in Microsoft Excel Spreadsheet .xlsx format
  1. tidystringdist
  • tidy_stringdist(): Generates similarity scores between two string values

2 Description of the Dummy Data

We assume that we are working on surveillance on a new infectious disease. There is one reference laboratory that conducts confirmatory testing for this infectious disease, and we define confirmed cases as those who test positive with the confirmatory test. We only test suspect cases, defined as those with symptoms relevant to this infectious disease, which means we expect that everyone who has been tested has a date of symptom onset, although we typically expect that this field is blank more often than not.

We test suspect cases regularly, sometimes twice within a 24-hour window. We are interested in both the first positive test result and the first negative test result after that, as we define clinical recovery as the day that a case has received a negative test result. For simplicity, we deal with only positive or negative test results, although in practice we expect test results to be inconclusive/equivocal or samples collected to be invalid for testing.

The reference laboratory has an information system, and everyday, the reference laboratory provides us with a laboratory linelist containing test results within the past 24 hours. The reference laboratory exports the file as a Microsoft Excel spreadsheet (.xlsx). Everyday, we process the laboratory linelist from the past 24 hours and all the cases since the beginning of the surveillance activity into the case linelist. We repeat this process everyday.

  • Each row in the laboratory linelist is a laboratory result, and a unique ID number of the laboratory result is automatically assigned by the reference laboratory information system using the variable LabSpecimenID.
    • Additionally, there is another ID number CaseID, that the information system automatically generates whenever there is a new name in the system. If there is an exact match of an existing name in the system, the laboratory information system carries over the same CaseID to the new laboratory result. However, if the person has been tested before, but the name is not an exact match, the laboratory information system creates a new CaseID as if both laboratory results were from distinct individuals.
    • This means that the CaseID variable does not uniquely identify the persons being tested, and we need to deduplicate these names manually later.
  • Each row in the case linelist is a confirmed case, and the unique CaseID number from the laboratory linelist is carried over to the case linelist. We used to process the case linelist in Microsoft Excel, so this is also a Microsoft Excel spreadsheet. Moving forward, we want to update this case linelist using R, but we still want the flexibility of being able to open this in Microsoft Excel, so we will export the processed case linelist in R to a Microsoft Excel Spreadsheet at the end.
    • We process this case linelist daily to add new cases from the latest laboratory linelist, and update information of existing cases.
    • As we manually deduplicate our data, we retain the lowest CaseID value.

The variables in the laboratory linelist are as follows:

  • LabSpecimenID: Unique ID of laboratory result
  • CaseID: Laboratory information system’s attempt to identify cases using exact match of names; manual deduplication required
  • Name: Name of the individual as encoded into the information system
  • Age: Calculated age based on date of birth encoded into the information system. For simplicity, we not show the date of birth variable, but it is ideal that ages are calculated from date of birth and not just reporting what the age was at the time of testing
  • Sex: Male or female (we assume this country only accepts two legal sexes)
  • Municipality: Municipality where the individual resides, as encoded into the information system
  • DateSpecimenCollection: Date when specimen was collected
  • DateResultReleased: Date when the result was released
  • DateOnset: Reported date of onset of symptoms relevant to the infectious disease
  • Result: Positive or negative

When we conduct data processing, it is important that we set up clear adjudication or validation rules. Most of the time, the adjudication comes outside of data processing. For example, we may instruct the reporting health facility to reencode the laboratory result in the information system. We may also review other health records and replace values manually in the information system. If we need to make modifications during data processing, we must be clear about the adjudication rules. We work with a fairly simple set of rules to demonstrate this point.

The variables in the case linelist, as well as the adjudication rules are as follows:

  • CaseID: This is the processed CaseID variable from the laboratory information system.
    • We first conduct manual deduplication and tag groups of laboratory results that are considered under the same individual.
    • After deduplicating, we retain the lowest CaseID number, e.g. between 11708 and 12890, we retain 11708.
  • The variables Name, Age, Sex, and Municipality are carried over from the laboratory linelist.
    • If there are conflicts in the values on any of these variables, then we need to review other health records or recheck with the reporting health facility.
    • In this example, we work with discrepancies in the Name value. The name with the most complete information is retained. For example, if one value has the middle name and the other does not, then the one with the middle name is retained.
  • The dates of specimen collection and result released are set up differently from the laboratory linelist. Specifically, in this case, we are primarily concerned with the first positive results (DateSpecimenCollection_PositiveFirst and DateResultReleased_PositiveFirst) and first negative result after that (DateSpecimenCollection_NegativeFirst and DateResultReleased_NegativeFirst).
    • This means that not all laboratory results from the same person are going to be used in this linelist of cases. There will be some loss of information.
    • This means that we have to restructure the values from the DateSpecimenCollection, DateResultReleased, and Result from the laboratory linelist to fit the columns in the case linelist.
  • The date of symptom onset DateOnset is also carried over from the laboratory linelist, unless there are discrepancies with the values from a more recent laboratory result entry.
    • For this example, our rule is that we keep the earliest recorded date of symptom onset, e.g. if the original date of symptom onset recorded was July 1, 2021, but the latest laboratory result recorded it as July 3, 2021, then we keep July 1, 2021.
    • We also check whether the date of symptom onset is on or before the date of specimen collection, as our hypothetical testing policy states that only symptomatic cases are going to be tested. If the date of symptom onset is after the date of specimen collection, then we replace the date of symptom onset with the date of specimen collection.
  • DateReport: This is the date when the case was added to the case linelist. We want to keep track of which cases were reported on which day, especially when we need to report the number of new cases to be reported for the day.


3 Setting Up

3.1 Downloading the files

We may download all the files in the Github repository here. We click on the green button Code then click Download ZIP. While the files may usually are found in our Downloads folder, we may unzip it to a new folder anywhere else in our computer (say, our Documents folder) and give it a name. This new folder serve as our working directory, so remember where it is.

Notice how the directory is organized into the following folders

  • labs: This is the folder where we store all the daily laboratory linelists. For this guide, we will work with fakelablinelist_2021-07-16.xlsx
  • cases: This is the folder where we store all the daily case linelists will be stored. For this guide, we will work with the case linelist fakecaselinelist_2021-07-15.xlsx and fakelablinelist_2021-07-16.xlsx. By the end of this guide we would have made fakecaselinelist_2021-07-16.xlsx.
  • code: This is the folder where we store our script files. Since we are processing laboratory results and cases daily with manual processing, it is recommended that we save a new script file every time we do our daily case processing. The R script file that contains all the codes in this is in dataprocessing_2021-07-16.R

All of the other files are not relevant for the guide or are output files that are generated as part of the analysis.

3.2 Creating a new RStudio Project

  1. Open RStudio. On the menu bar, select File > New Project...
  2. The New Project Wizard dialog box opens. Select Existing Directory.
  3. Under Project working directory, select Browse... and locate the folder of the working directory.
  4. Select Create Project. At this point, we have created a new project file (.Rproj) as well as set the working directory.
  5. Create a new R script file using the keyboard shortcut Ctrl-Shift-N or going to File > New File > R Script or clicking on the New File button on the topmost left corner right below the menu bar then clicking R Script. The left side of the environment should split into two - with the script file found on the upper left side. The script file is similar to a do file for those who are familiar with Stata. Ideally, we should be saving all our code in a script file in case we need to redo or repeat analyses so that all we have to do is to run the script rather than coding everything from scratch again.
  6. Save the script file every now and then. Give it a name. In the repository, this is named DataProcessing.R. Open that script file if we just want to run the code.

Alternatively, if we are familiar with setting up Git on RStudio, we may also set up the RStudio project by cloning the repository instead of downloading the ZIP file.

3.3 Loading packages

We use the following packages. We make sure the packages are installed beforehand using the install.packages() function.

Comments start with the hash key # and are helpful for us to make the code more understandable. We may choose to remove them and nothing will happen to the code, but it will make things harder for us to remember what we actually wrote as code.

####### LOADING PACKAGES ####### 
# If using for the first time, use the following install.packages commands in comments
# Internet connection required, only need to do this once ever
# install.packages("tidyverse")
# install.packages("readxl")
# install.packages("writexl")
# install.packages("tidystringdist")
library(tidyverse)        # General data analysis package
library(readxl)           # For reading Excel files
library(writexl)          # For exporting to Excel
library(tidystringdist)   # For fuzzy matching

3.4 Importing dummy data

We now import both linelists. We store the the laboratory linelist in a tibble called lab_today, while we store the case linelist in a tibble called case_yday. Tibbles are data frames that make data manipulation using the tidyverse package a little easier. We will import using readxl’s read_xlsx()s function. Since the files are not in the same subfolder as the code, we would need to instruct R to go up one level, then locate the subfolder, then locate the file. Therefore, the syntax of the directory starts with a "../" to indicate the relative position of the subfolder.

The name of the imported files may change everyday. For example, our file names may have dates so that we can remember what day the file was processed. If that is the case, we have to clearly specify what part of our codes will we need to update everyday, and which ones do not need to be changed. We will get creative with the hash key # to make this happpen. In our case, we can enclose the code in double rows of hash keys ###... so that we can easily see which part of the codes we need to update daily. We do this so that we do not need to review the entire code everyday and save some time with our workflow.

###############################################################################
###############################################################################
### Anytime we enclose the code in a double row of hash keys, we are saying ###
### that this part of the code must be updated every time the code is run.  ###
###############################################################################
###############################################################################

###############################################################################
###############################################################################

####### IMPORTING THE DUMMY DATA ####### 

#### Import dummy laboratory linelist
lab_today <- read_xlsx("../labs/fakelablinelist_2021-07-16.xlsx")

#### Import dummy case linelist
case_yday <- read_xlsx("../cases/fakecaselinelist_2021-07-15.xlsx")

##############################################################################
##############################################################################

We take a peek of our data by displaying the first six rows using the head() function in base R.

####### INSPECTING OUR DATA ####### 

#### Show first six rows
head(lab_today)
## # A tibble: 6 x 10
##   LabSpecimenID CaseID Name        Age Sex   Municipality   DateSpecimenCollect~
##   <chr>          <dbl> <chr>     <dbl> <chr> <chr>          <dttm>              
## 1 AB-0458214     13597 Agata Lu~    35 F     Port Sipleach  2021-07-15 00:00:00 
## 2 AB-0458203     44979 Ailsa Hu~    70 F     Mexe           2021-07-15 00:00:00 
## 3 AB-0458219     44980 Amal Ford    40 M     Grand Wellwor~ 2021-07-15 00:00:00 
## 4 AB-0458206     13479 Ceara We~     2 F     Chorgains      2021-07-15 00:00:00 
## 5 AB-0458229     18793 Dustin P~    10 M     Grand Wellwor~ 2021-07-15 00:00:00 
## 6 AB-0458209     12579 Elijah H~    38 M     Mexe           2021-07-15 00:00:00 
## # ... with 3 more variables: DateResultReleased <dttm>, DateOnset <dttm>,
## #   Result <chr>
head(case_yday)
## # A tibble: 6 x 11
##   CaseID Name     Age Sex   Municipality DateSpecimenCollec~ DateResultReleased~
##    <dbl> <chr>  <dbl> <chr> <chr>        <dttm>              <dttm>             
## 1  13597 Agata~    35 F     Port Siplea~ 2021-07-02 00:00:00 2021-07-03 00:00:00
## 2  13479 Ceara~     2 F     Chorgains    2021-07-01 00:00:00 2021-07-02 00:00:00
## 3  18793 Dusti~    10 M     Grand Wellw~ 2021-07-08 00:00:00 2021-07-09 00:00:00
## 4  12579 Elija~    38 M     Mexe         2021-07-03 00:00:00 2021-07-04 00:00:00
## 5  13289 Ella-~    58 F     Grand Wellw~ 2021-07-02 00:00:00 2021-07-03 00:00:00
## 6  13547 Franc~     2 M     Eastmsallbu~ 2021-07-10 00:00:00 2021-07-11 00:00:00
## # ... with 4 more variables: DateSpecimenCollection_NegativeFirst <dttm>,
## #   DateResultReleased_NegativeFirst <dttm>, DateOnset <dttm>,
## #   DateReport <dttm>

We may also view the full tibbles by using the View() function, i.e. View(lab_today) and View(case_yday) on the Console (by default, on the bottom right of RStudio). A new tab opens on the Script editor (by default, on the upper left of RStudio).

The full laboratory linelist lab_today looks like this:

LabSpecimenID CaseID Name Age Sex Municipality DateSpecimenCollection DateResultReleased DateOnset Result
1 AB-0458214 13597 Agata Lucas 35 F Port Sipleach 2021-07-15 2021-07-16 NA Positive
2 AB-0458203 44979 Ailsa Hurst 70 F Mexe 2021-07-15 2021-07-16 2021-07-11 Negative
3 AB-0458219 44980 Amal Ford 40 M Grand Wellworth 2021-07-15 2021-07-15 2021-07-10 Positive
4 AB-0458206 13479 Ceara West 2 F Chorgains 2021-07-15 2021-07-16 2021-07-08 Negative
5 AB-0458229 18793 Dustin Payne 10 M Grand Wellworth 2021-07-15 2021-07-15 2021-07-09 Positive
6 AB-0458209 12579 Elijah Henson 38 M Mexe 2021-07-15 2021-07-16 2021-07-06 Negative
7 AB-0458217 13289 Ella-Mai Gregory 58 F Grand Wellworth 2021-07-15 2021-07-16 NA Positive
8 AB-0458218 44985 Emilee Horn 50 F Chorgains 2021-07-15 2021-07-16 2021-07-14 Negative
9 AB-0458207 44986 Martin Romero 18 M Port Sipleach 2021-07-15 2021-07-15 2021-07-16 Positive
10 AB-0458227 44987 Martin F Romero 18 M Port Sipleach 2021-07-16 2021-07-16 2021-07-12 Positive
11 AB-0458204 44988 Eve Mcbride 58 F San Wadhor 2021-07-15 2021-07-15 2021-07-10 Positive
12 AB-0458221 44988 Eve Mcbride 58 F San Wadhor 2021-07-16 2021-07-16 2021-07-11 Positive
13 AB-0458222 44981 Fabien Escobar 55 M Chorgains 2021-07-15 2021-07-16 2021-07-09 Negative
14 AB-0458201 44990 Fern Christian Mcarthur 40 M Port Sipleach 2021-07-15 2021-07-15 2021-07-14 Positive
15 AB-0458215 44991 Fern Mcarthur 40 M Port Sipleach 2021-07-16 2021-07-16 2021-07-14 Positive
16 AB-0458220 44982 Franciszek Vickers 2 M Eastmsallbuck Creek 2021-07-15 2021-07-16 2021-07-10 Negative
17 AB-0458223 44983 Harmony Howe 74 F Mexe 2021-07-15 2021-07-16 NA Negative
18 AB-0458202 44994 Ishaaq Baker 50 M Eastmsallbuck Creek 2021-07-15 2021-07-16 NA Negative
19 AB-0458210 44992 Jessica Bauer 3 F Eastmsallbuck Creek 2021-07-15 2021-07-15 NA Positive
20 AB-0458230 44993 Jess Bauer 3 F Eastmsallbuck Creek 2021-07-15 2021-07-16 2021-07-10 Positive
21 AB-0458224 44984 Kanye Novak 65 M San Wadhor 2021-07-15 2021-07-16 2021-07-10 Negative
22 AB-0458216 44995 Kyran Roach 20 M San Wadhor 2021-07-15 2021-07-15 2021-07-13 Negative
23 AB-0458225 18400 Leonidas Hudson 14 M Eastmsallbuck Creek 2021-07-15 2021-07-15 2021-07-06 Negative
24 AB-0458213 44996 Maud Shields 65 F Chorgains 2021-07-15 2021-07-16 2021-07-15 Negative
25 AB-0458212 44997 Penelope Fields 16 F Mexe 2021-07-15 2021-07-16 NA Positive

The full case linelist case_today looks like this

CaseID Name Age Sex Municipality DateSpecimenCollection_PositiveFirst DateResultReleased_PositiveFirst DateSpecimenCollection_NegativeFirst DateResultReleased_NegativeFirst DateOnset DateReport
1 13597 Agata Lucas 35 F Port Sipleach 2021-07-02 2021-07-03 NA NA 2021-06-30 2021-07-04
2 13479 Ceara West 2 F Chorgains 2021-07-01 2021-07-02 NA NA 2021-07-08 2021-07-03
3 18793 Dustin Payne 10 M Grand Wellworth 2021-07-08 2021-07-09 NA NA 2021-07-09 2021-07-10
4 12579 Elijah Henson 38 M Mexe 2021-07-03 2021-07-04 2021-07-10 2021-07-11 2021-07-02 2021-07-05
5 13289 Ella-Mai Gregory 58 F Grand Wellworth 2021-07-02 2021-07-03 NA NA NA 2021-07-04
6 13547 Francissek Vickers 2 M Eastmsallbuck Creek 2021-07-10 2021-07-11 NA NA 2021-07-10 2021-07-12
7 18400 Leonidas Hudson 14 M Eastmsallbuck Creek 2021-07-07 2021-07-08 NA NA 2021-07-06 2021-07-09
8 13566 Penelope F. Fields 45 F San Wadhor 2021-07-02 2021-07-02 NA NA 2021-07-01 2021-07-03
9 13788 Eve M. Mcbride 58 F San Wadhor 2021-07-02 2021-07-02 NA NA 2021-06-30 2021-07-03

3.5 Reading the dates correctly

If we look at the data structure of both tibbles lab_today and case_today using the base R function str(), we notice the dates are read as a calendar date and time object (POSIXct).

#### Display the data structure
str(lab_today)
## tibble [25 x 10] (S3: tbl_df/tbl/data.frame)
##  $ LabSpecimenID         : chr [1:25] "AB-0458214" "AB-0458203" "AB-0458219" "AB-0458206" ...
##  $ CaseID                : num [1:25] 13597 44979 44980 13479 18793 ...
##  $ Name                  : chr [1:25] "Agata Lucas" "Ailsa Hurst" "Amal Ford" "Ceara West" ...
##  $ Age                   : num [1:25] 35 70 40 2 10 38 58 50 18 18 ...
##  $ Sex                   : chr [1:25] "F" "F" "M" "F" ...
##  $ Municipality          : chr [1:25] "Port Sipleach" "Mexe" "Grand Wellworth" "Chorgains" ...
##  $ DateSpecimenCollection: POSIXct[1:25], format: "2021-07-15" "2021-07-15" ...
##  $ DateResultReleased    : POSIXct[1:25], format: "2021-07-16" "2021-07-16" ...
##  $ DateOnset             : POSIXct[1:25], format: NA "2021-07-11" ...
##  $ Result                : chr [1:25] "Positive" "Negative" "Positive" "Negative" ...
str(case_yday)
## tibble [9 x 11] (S3: tbl_df/tbl/data.frame)
##  $ CaseID                              : num [1:9] 13597 13479 18793 12579 13289 ...
##  $ Name                                : chr [1:9] "Agata Lucas" "Ceara West" "Dustin Payne" "Elijah Henson" ...
##  $ Age                                 : num [1:9] 35 2 10 38 58 2 14 45 58
##  $ Sex                                 : chr [1:9] "F" "F" "M" "M" ...
##  $ Municipality                        : chr [1:9] "Port Sipleach" "Chorgains" "Grand Wellworth" "Mexe" ...
##  $ DateSpecimenCollection_PositiveFirst: POSIXct[1:9], format: "2021-07-02" "2021-07-01" ...
##  $ DateResultReleased_PositiveFirst    : POSIXct[1:9], format: "2021-07-03" "2021-07-02" ...
##  $ DateSpecimenCollection_NegativeFirst: POSIXct[1:9], format: NA NA ...
##  $ DateResultReleased_NegativeFirst    : POSIXct[1:9], format: NA NA ...
##  $ DateOnset                           : POSIXct[1:9], format: "2021-06-30" "2021-07-08" ...
##  $ DateReport                          : POSIXct[1:9], format: "2021-07-04" "2021-07-03" ...

By default, Microsoft Excel reads dates as mm/dd/yyyy (for countries using mm/dd/yyyy format) but this is not a good date format. By default, we should always use yyyy-mm-dd, as it is always clear which one is the month and date. This date format is also the default in R, so when importing Excel spreadsheets, R tries to read mm/dd/yyyy as dates. In the event that there are variables that are not read as dates, we may use the as.POSIXct() function in base R. For syntax on the date formats, check this link. For example:

# Sample vector of dates
sampledate <- c("07/01/2021", "07/02/2021", "07/03/2021")
str(sampledate)       # Read as string
##  chr [1:3] "07/01/2021" "07/02/2021" "07/03/2021"
# Read as calendar date and time
sampledate <- as.POSIXct(sampledate, format = "%m/%d/%Y")
str(sampledate)       # Read as calendar date and time
##  POSIXct[1:3], format: "2021-07-01" "2021-07-02" "2021-07-03"

3.6 Overview of the data processing workflow

Before we begin the actual data processing, we first outline the steps that we are taking.

  1. Process new cases from the laboratory linelist.
    1. Identify cases that have updates in the laboratory linelist.
      1. Create a data frame of ID variable pairs - LabSpecimenID from the laboratory linelist and CaseID from the case linelist. Store results in the data frame dedup_df.
      2. Add names and other information from the tibble lab_today in aid of manual deduplication.
        1. Store laboratory linelist information in the tibble lab_today_info.
        2. Store case linelist information in the tibble case_yday_info.
        3. Merge the tibble lab_today_info into the data frame dedup_df. Store results in the data frame dedup_df_info.
        4. Merge the tibble case_yday_info into the data frame dedup_df_info.
      3. Generate similarity scores using Jaro-Winkler (jw) distance. Store results in the data frame dedup_df_jw. 1) Keep results with jw less than or equal to 0.3. This is just one way to generate the fuzzy matching algorithm.
      4. Manually inspect results in the data frame dedup_df_jw.
        1. Subset rows that are duplicates from case linelist. These are the rows that we check later on to update the data of existing cases. Store results in the data frame dedup_df_jw_manual.
    2. Identify new cases within the laboratory linelist.
      1. Filter positive results from laboratory linelist. Store in the tibble lab_today_pos.
        1. Filter out duplicates already identified from the data frame dedup_df_jw_manual as those cases are not new cases. Store the results in the tibble lab_today_pos_nodup
      2. Create a data frame of ID variable pairs - LabSpecimenID from tibble lab_today_pos_nodup. Store results in the data frame dedup_new_df.
      3. Add names and other information from lab_today_pos_nodup in aid of manual deduplication and adjudication. Store results in the tibble lab_today_pos_nodup_info. 2) Merge the data frame dedup_new_df with the tibble lab_today_pos_nodup_info. Store results in the data frame dedup_new_df_info. 3) Merge the data frame dedup_new_df_info with the tibble lab_today_pos_nodup_info. We do this twice because we paired the LabSpecimenID of the tibble lab_today_pos_nodup by itself.
      4. Generate similarity scores using Jaro-Winkler (jw) distance. Store results in the data frame dedup_new_df_jw. 1) Keep results with jw less than or equal to 0.3.
      5. Manually inspect results in the data frame dedup_df_new_jw.
        1. Subset rows that are duplicates of one another. These are the rows that we check later on to create a new case linelist.
        2. Manually tag a new ID variable to identify duplicate groups. Name the column DuplicateID. Store results in the data frame dedup_df_new_jw_manual.
  2. Save new case data.
    1. Filter new cases from lab_today_pos_nodup with no duplication issues. Store results in the tibble lab_today_pos_nodup_nodup (we use nodup twice to indicate this has been deduplicated twice over).
    2. Transform the tibble lab_today_pos_nodup_nodup into the columns of the tibble case_yday so that we may append these new cases to the case linelist later on. Store results in the tibble case_today.
    3. The tibble case_today contains all the new cases processed from the laboratory linelist. So far, we have the new cases with no issues in deduplication. We will add the deduplicated new cases later on.
  3. Adjudicate on manually deduplicated data.
    1. Complete compilation of new cases
      1. Implement the adjudication rules on the tibble dedup_new_df_jw_manual and store the results in the tibble case_today_dedup.
      2. Add the rows of the tibble case_today_dedup into the tibble case_today. At this point, we have completed the new cases for addition to the case linelist.
    2. Update data of current cases.
      1. Retrieve the ID pairs from the tibble dedup_df_jw_manual. Store results in the tibble dedup_df_jw_IDpairs
      2. Merge the rest of the information from lab_today with dedup_df_jw_IDpairs. Store results in the tibble dedup_df_jw_allinfo
      3. Transform the tibble dedup_df_jw_allinfo into the columns of the tibble case_yday. Apply the adjudication rules. Store results in the tibble case_today_newinfo.
      4. Retrieve the rows from the tibble case_yday that need updating by filtering the CaseID values from the tibble case_today_newinfo. Store results in the tibble case_today_oldinfo.
      5. Combine the rows from the tibbles case_today_newinfo and case_today_oldinfo. Store in a data frame case_today_recon. We want to compare the old data from case_today_oldinfo with the new data from case_today_newinfo.
      6. Apply the adjudication rules. Store results in the tibble case_yday_update. The tibble case_yday_update contains all the old cases updated with data from the laboratory linelist.
  4. Compile the latest case linelist.
    1. From the tibble case_yday, filter out the CaseID values that are in case_yday_update. These are the cases that have no new updates. Store results in the tibble case_yday_noupdate. The tibble case_yday_noupdate contains all the old cases with no new updates from the laboratory linelist.
    2. Combine the rows from the tibbles case_today, case_yday_update, and case_yday_noupdate. Store results in the tibble called case_latest. The tibble case_latest contains the latest case linelist.
    3. Check date inconsistencies, e.g. date of symptom onset should not be after date of specimen collection of first positive test according to our hypothetical testing policy. Replace date of symptom onset with the date of specimen collection of first positive test if that is the case.
    4. Export the case_latest linelist into Microsoft Excel spreadsheet format.
    5. Tomorrow, the process repeats again and today’s tibble case_latest becomes tomorrow’s tibble case_yday.

A visual representation of the relationships of all the data frames generated by this workflow is presented below:


4 Processing New Cases from the Laboratory Linelist

We now deduplicate new cases from the lab linelist. We need to conduct two deduplication activities:

  • First, we need to check if any of the results from the laboratory linelist (lab_today) have already been added as cases in the case linelist (case_yday), and use these results to update the data of these cases.
  • Second, we need to deduplicate new cases within the laboratory linelist. There might be cases whose names appear twice in the laboratory linelist but are new cases that are not yet found on the case linelist.

Because our hypothetical laboratory information system can only detect exact matches of names for the generation of the CaseID, we need to use fuzzy matching to manually deduplicate names.

4.1 Identifying cases that have updates in the laboratory linelist

4.1.1 Creating a data frame of ID variable pairs

We create a tibble (dedup_df) that contains the column LabSpecimenID of the tibble lab_today paired with all CaseID values in the tibble case_yday. Recall that these ID variables uniquely identify each row. Therefore, the tibble dedup_df represents all pairs of IDs that are useful for deduplication. We use the base R function expand.grid() to generate a data frame of all ID pairs. It is important to note that if we decide to break a long function across several lines of code, R reads them as one function. We just have to make sure that all the parentheses are closed appropriately.

#### Create a data frame of ID variable pairs
# By default, expand.grid treats the string values as factors, 
# so we have to instruct R not to do that
dedup_df <- expand.grid(lab_today$LabSpecimenID, case_yday$CaseID, 
                        stringsAsFactors = FALSE)
# By default expand.grid saves the columns as V1 and V2. 
# But that's not helpful at all. We can rename them using colnames()
colnames(dedup_df) <- c("lab_today_id", "case_yday_id")

4.1.2 Add names and other information in aid of manual deduplication

We now add more information to the ID variables to aid us in decision making on whether or not the ID pairs are duplicates. In our laboratory linelist, Name, Age, Sex, and Municipality all help in making that decision. We also add the ID variables (LabSpecimenID and CaseID), DateOnset and Result.

We first retrieve the relevant columns from lab_today and case_yday using dplyr’s select() function. We then store the results in the tibbles lab_today_info and case_yday_info. To make our code readable, we use dplyr’s pipe operator (%>%) instead of nesting parentheses.

#### Select columns in aid of manual deduplication
# The pipe %>% operator makes code more readable. For example:
# select(dataframe, columns) can be expressed as dataframe %>% select(columns)
lab_today_info <- lab_today %>% 
  select(LabSpecimenID, Name, Age, Sex, Municipality, DateOnset, Result)
case_yday_info <- case_yday %>% 
  select(CaseID, Name, Age, Sex, Municipality, DateOnset)

Then, we merge the data from the tibble lab_today_info into the data frame dedup_df and store into a new data frame called dedup_df_info using the base R function merge(). We also arrange the columns as we go along with the merging using dplyr’s relocate() function.

#### Merge dedup_df with info from lab_today_info
# When merging, there are two data frames, labeled as x and y
# by.x and by.y are the variable names with which to link the two data frames
# In x, the ID is lab_today_id, in y, the ID is LabSpecimenID
dedup_df_info <- merge(x = dedup_df, y = lab_today_info, 
                       by.x = "lab_today_id", by.y = "LabSpecimenID")

# We want to put the case_yday_id column at the end
dedup_df_info <- dedup_df_info %>% relocate(case_yday_id, .after = last_col())

At this point, the data frame dedup_df_info has the relevant columns from the tibble lab_today_info merged with the column lab_today_id from the data frame dedup_df, and we have placed the column case_yday_id at the end. Now, we execute similar steps to merge the data from case_yday_info.

#### Merge with info from case_yday_info
## NOTE that the x data frame is now dedup_df_info, NOT dedup_df
# When merging, there are two data frames, labeled as x and y
# by.x and by.y are the variable names with which to link the two data frames
# In x, the ID is case_yday_id, in y, the ID is CaseID
dedup_df_info <- merge(x = dedup_df_info, y = case_yday_info, 
                       by.x = "case_yday_id", by.y = "CaseID")


# The columns Name, Age, Sex, and Municipality appear in both x and y data frames
# The command appends a suffix .x or .y to indicate which column came from which

# We want to put the case_yday_id right before the Name.y column
# Visually this remind us that lab_today_info are columns with suffix .x
# and case_yday_info are columns with suffix .y
dedup_df_info <- dedup_df_info %>% relocate(case_yday_id, .before = Name.y)

4.1.3 Generating similarity scores

We now run tidystringdist’s tidy_stringdist() function to generate similarity scores of all our pairwise names. We may read more about the different methods under this command by reading the documentation (Type ?tidy_stringdist-metrics on the Console). If we do not specify the method, the command generates a score using all available methods (which can be time-consuming), so in this example, we just use one - the Jaro-Winkler distance (jw), which provides similarity scores on a scale of 0 to 1, where 0 is an exact match (the lower the score, the likelier it is a duplicate). We store the results in the data frame dedup_df_jw and retain pairs with a score of 0.3 and below for further inspection using dplyr’s filter() function, then sort by jw using dplyr’s arrange() function. Feel free to experiment with what method and cutoff works.

#### Generate similarity scores
# Use Jaro-Winkler distance to generate similarity scores
# v1 and v2 are the column of names
dedup_df_jw <- tidy_stringdist(dedup_df_info, v1 = "Name.x", 
                               v2 = "Name.y", method = "jw")

# Filter jw <= 0.3
dedup_df_jw <- dedup_df_jw %>% filter(jw <= 0.3)

# Sort by jw
dedup_df_jw <- dedup_df_jw %>% arrange(jw)

4.1.4 Manually selecting duplicates

Now with the data frame dedup_df_jw, we manually select rows that are duplicates by looking at the entire data frame. We type View(dedup_df_jw) on the Console to view it as a separate tab on the Script editor. Take note of the rows that we mark as duplicates.

lab_today_id Name.x Age.x Sex.x Municipality.x DateOnset.x Result case_yday_id Name.y Age.y Sex.y Municipality.y DateOnset.y jw
1 AB-0458209 Elijah Henson 38 M Mexe 2021-07-06 Negative 12579 Elijah Henson 38 M Mexe 2021-07-02 0.0000000
2 AB-0458217 Ella-Mai Gregory 58 F Grand Wellworth NA Positive 13289 Ella-Mai Gregory 58 F Grand Wellworth NA 0.0000000
3 AB-0458206 Ceara West 2 F Chorgains 2021-07-08 Negative 13479 Ceara West 2 F Chorgains 2021-07-08 0.0000000
4 AB-0458214 Agata Lucas 35 F Port Sipleach NA Positive 13597 Agata Lucas 35 F Port Sipleach 2021-06-30 0.0000000
5 AB-0458225 Leonidas Hudson 14 M Eastmsallbuck Creek 2021-07-06 Negative 18400 Leonidas Hudson 14 M Eastmsallbuck Creek 2021-07-06 0.0000000
6 AB-0458229 Dustin Payne 10 M Grand Wellworth 2021-07-09 Positive 18793 Dustin Payne 10 M Grand Wellworth 2021-07-09 0.0000000
7 AB-0458220 Franciszek Vickers 2 M Eastmsallbuck Creek 2021-07-10 Negative 13547 Francissek Vickers 2 M Eastmsallbuck Creek 2021-07-10 0.0370370
8 AB-0458212 Penelope Fields 16 F Mexe NA Positive 13566 Penelope F. Fields 45 F San Wadhor 2021-07-01 0.0555556
9 AB-0458204 Eve Mcbride 58 F San Wadhor 2021-07-10 Positive 13788 Eve M. Mcbride 58 F San Wadhor 2021-06-30 0.0714286
10 AB-0458221 Eve Mcbride 58 F San Wadhor 2021-07-11 Positive 13788 Eve M. Mcbride 58 F San Wadhor 2021-06-30 0.0714286

In this example, all but Row 8 are duplicates. Penelope Fields is a 16 year old female from Mexe, while Penelope F. Fields is a 45 year old female who is also from Mexe. It is likely that these two are related which is why the names are similar, or the age (or date of birth) was encoded incorrectly. We would need to revalidate this information again outside of this data processing workflow, but let us assume in this case we have checked that they really are two different persons.

Now we save the row numbers in a new tibble called dedup_df_jw_manual using the subset functions in base R. We can place a vector of row numbers inside the bracket beside a data frame, then followed by a comma with nothing else (to indicate we are selecting all columns).

##############################################################################
##############################################################################

# Selecting duplicates - only Row 8 is not a duplicate
# View(dedup_df_jw) to view the tibble
dedup_df_jw_manual <- dedup_df_jw[c(1:7, 9:10),]

##############################################################################
##############################################################################

The tibble dedup_df_jw_manual now contains laboratory results of cases already existing in the case linelist, and we need to update their information later on with this tibble. The next step deduplicates names among new cases within the laboratory linelist.

4.2 Identifying new cases within the laboratory linelist

4.2.1 Filtering positive results

We first filter only positive results from the laboratory linelist (lab_today) because these contain all the new cases and current cases with subsequent positive results. We save the filtered data in a tibble called lab_today_pos.

#### Filter positive results only
lab_today_pos <- lab_today %>% filter(Result == "Positive")

Then, we filter out the duplicates that we have identified from dedup_df_jw_manual. We use dplyr’s filter() function with the %in% operator. The %in% operator works like this: If the code is written as X %in% Y, then this means that we filter X based on the values in Y. In our case, X is the LabSpecimenID and Y is dedup_df_jw_manual$lab_today_id. Since we need the values that are NOT in Y, we need to enclose the entire expression with !() which gives us the opposite of what we need to filter. We save the filtered data in a tibble called lab_today_pos_nodup.

#### Filter out the duplicates identified from dedup_df_jw_manual
# !() means we want the opposite result of the filter expression
# X %in% Y means we want to filter X based on values in Y
# We want to filter by laboratory result ID
# X: LabSpecimenID in lab_today_pos, Y: dedup_df_jw_manual$lab_today_id
lab_today_pos_nodup <- lab_today_pos %>% 
  filter(!(LabSpecimenID %in% dedup_df_jw_manual$lab_today_id))

Taking a look at the resulting tibble (View(lab_today_pos_nodup)):

LabSpecimenID CaseID Name Age Sex Municipality DateSpecimenCollection DateResultReleased DateOnset Result
1 AB-0458219 44980 Amal Ford 40 M Grand Wellworth 2021-07-15 2021-07-15 2021-07-10 Positive
2 AB-0458207 44986 Martin Romero 18 M Port Sipleach 2021-07-15 2021-07-15 2021-07-16 Positive
3 AB-0458227 44987 Martin F Romero 18 M Port Sipleach 2021-07-16 2021-07-16 2021-07-12 Positive
4 AB-0458201 44990 Fern Christian Mcarthur 40 M Port Sipleach 2021-07-15 2021-07-15 2021-07-14 Positive
5 AB-0458215 44991 Fern Mcarthur 40 M Port Sipleach 2021-07-16 2021-07-16 2021-07-14 Positive
6 AB-0458210 44992 Jessica Bauer 3 F Eastmsallbuck Creek 2021-07-15 2021-07-15 NA Positive
7 AB-0458230 44993 Jess Bauer 3 F Eastmsallbuck Creek 2021-07-15 2021-07-16 2021-07-10 Positive
8 AB-0458212 44997 Penelope Fields 16 F Mexe 2021-07-15 2021-07-16 NA Positive

This is the tibble that contains new cases, but have not yet been deduplicated. For example, rows 2 and 3, 4 and 5, and 6 and 7 are potentially duplicates. Notice how Penelope Fields is in this tibble because we have determined beforehand that the Penelope F. Fields in the case linelist is not the same Penelope Fields.

4.2.2 Creating a data frame of ID variable pairs

We now create a tibble called dedup_new_df of LabspecimenID in the tibble lab_today_pos_nodup paired with all other LabSpecimenID in the same tibble. We use the expand.grid() function again. To save some space, we automatically delete the pairs that are equal to one another using the filter() function.

#### Creating a data frame of ID variable pairs
# Create a tibble that pairs all LabSpecimenID with themselves
dedup_new_df <- expand.grid(lab_today_pos_nodup$LabSpecimenID, 
                            lab_today_pos_nodup$LabSpecimenID,
                            stringsAsFactors = FALSE)

# Name columns
colnames(dedup_new_df) <- c("LabSpecimenID1", "LabSpecimenID2")

# Filter out values equal to one another
dedup_new_df <- dedup_new_df %>% filter(!(LabSpecimenID1 == LabSpecimenID2))

4.2.3 Add names and other information for deduplication and adjudication

We now add more information to the names to aid us in decision making on whether or not the pairs are duplicates, and to decide which among the duplicates do we retain. We use Age, Sex, Municipality, DateOnset, DateSpecimenCollection, and DateResultReleased to make that decision. We also add the CaseID variable. We store the filtered columns in the tibble lab_today_pos_nodup_info.

#### Select columns in aid of manual deduplication
lab_today_pos_nodup_info <- lab_today_pos_nodup %>% 
  select(LabSpecimenID, CaseID, Name, Age, Sex, Municipality, 
         DateOnset, DateSpecimenCollection, DateResultReleased)

Then we use those to merge back into dedup_new_df and save into a new data frame called dedup_new_df_info

#### Merge twice
# Merge with LabSpecimen1
dedup_new_df_info <- merge(x = dedup_new_df, y = lab_today_pos_nodup_info,
                           by.x = "LabSpecimenID1", by.y = "LabSpecimenID")

# Merge with LabSpecimen2 - note that x is now dedup_new_df_info
dedup_new_df_info <- merge(x = dedup_new_df_info, y = lab_today_pos_nodup_info,
                           by.x = "LabSpecimenID2", by.y = "LabSpecimenID")

# Relocate LabSpecimen2 before CaseID.y
dedup_new_df_info <- dedup_new_df_info %>% 
  relocate(LabSpecimenID2, .before = CaseID.y)

4.2.4 Generate similarity scores

We now run tidystringdist’s tidy_stringdist() function to generate similarity scores of all our pairwsise names. We use the jw method and retain scores 0.3 and below for further inspection. We store the results in a new data frame called dedup_new_df_jw. We also sort the results by jw and Name.x (the first name column) in aid of manual inspection.

#### Generate similarity scores
# Use Jaro-Winkler distance to generate similarity scores
# v1 and v2 are the column of names
dedup_new_df_jw <- tidy_stringdist(dedup_new_df_info, v1 = "Name.x", 
                                   v2 = "Name.y", method = "jw")

# Filter jw <= 0.3
dedup_new_df_jw <- dedup_new_df_jw %>% filter(jw <= 0.3)

# Sort by jw and Name.x
dedup_new_df_jw <- dedup_new_df_jw %>% arrange(jw, Name.x)

4.2.5 Manually selecting duplicates

Now with the data frame dedup_new_df_jw, we manually select rows that are duplicates by looking at the entire data frame. We type View(dedup_new_df_jw) on the Console to view it as a separate tab on the Script editor. Take note of the rows that we choose to retain.

LabSpecimenID1 CaseID.x Name.x Age.x Sex.x Municipality.x DateOnset.x DateSpecimenCollection.x DateResultReleased.x LabSpecimenID2 CaseID.y Name.y Age.y Sex.y Municipality.y DateOnset.y DateSpecimenCollection.y DateResultReleased.y jw
1 AB-0458227 44987 Martin F Romero 18 M Port Sipleach 2021-07-12 2021-07-16 2021-07-16 AB-0458207 44986 Martin Romero 18 M Port Sipleach 2021-07-16 2021-07-15 2021-07-15 0.0444444
2 AB-0458207 44986 Martin Romero 18 M Port Sipleach 2021-07-16 2021-07-15 2021-07-15 AB-0458227 44987 Martin F Romero 18 M Port Sipleach 2021-07-12 2021-07-16 2021-07-16 0.0444444
3 AB-0458230 44993 Jess Bauer 3 F Eastmsallbuck Creek 2021-07-10 2021-07-15 2021-07-16 AB-0458210 44992 Jessica Bauer 3 F Eastmsallbuck Creek NA 2021-07-15 2021-07-15 0.1269231
4 AB-0458210 44992 Jessica Bauer 3 F Eastmsallbuck Creek NA 2021-07-15 2021-07-15 AB-0458230 44993 Jess Bauer 3 F Eastmsallbuck Creek 2021-07-10 2021-07-15 2021-07-16 0.1269231
5 AB-0458201 44990 Fern Christian Mcarthur 40 M Port Sipleach 2021-07-14 2021-07-15 2021-07-15 AB-0458215 44991 Fern Mcarthur 40 M Port Sipleach 2021-07-14 2021-07-16 2021-07-16 0.2474916
6 AB-0458215 44991 Fern Mcarthur 40 M Port Sipleach 2021-07-14 2021-07-16 2021-07-16 AB-0458201 44990 Fern Christian Mcarthur 40 M Port Sipleach 2021-07-14 2021-07-15 2021-07-15 0.2474916

There are no rows to exclude as all the rows are duplicates of one another. Rows 1 and 2, 3 and 4, and 5 and 6 are duplicates of one another. We save the results in a new data frame dedup_new_df_jw_manual. We add a new column called DuplicateID which manually tag rows as duplicates of one another.

##############################################################################
##############################################################################

# No rows to exclude, so save everything
dedup_new_df_jw_manual <- dedup_new_df_jw

# Add a new column that manually tags the duplicates
# Row 1 and 2 have DuplicateID 1
# Row 3 and 4 have DuplicateID 2
# Row 5 and 6 have DuplicateID 3
dedup_new_df_jw_manual$DuplicateID <- c(1, 1, 2, 2, 3, 3)

##############################################################################
##############################################################################


5 Saving New Case Data

5.1 Filtering new cases with no duplication issues

Now we filter out the LabSpecimenID values from the tibble lab_today_pos_nodup that appear in LabSpecimenID1 from the tibble dedup_new_df_jw_manual. We save the results in a new tibble lab_today_pos_nodup_nodup (nodup is used twice to indicate we have removed duplicates twice over).

#### Filtering new cases with no duplication issues
lab_today_pos_nodup_nodup <- lab_today_pos_nodup %>% 
  filter(!(LabSpecimenID %in% dedup_new_df_jw_manual$LabSpecimenID1))

We see we have two cases:

LabSpecimenID CaseID Name Age Sex Municipality DateSpecimenCollection DateResultReleased DateOnset Result
1 AB-0458219 44980 Amal Ford 40 M Grand Wellworth 2021-07-15 2021-07-15 2021-07-10 Positive
2 AB-0458212 44997 Penelope Fields 16 F Mexe 2021-07-15 2021-07-16 NA Positive

5.2 Preparing to add to case linelist

Now we transform the tibble lab_today_pos_nodup_nodup into the columns of the tibble case_yday so that we may append them later. We store the new tibble as case_today. Recall that:

  • We carry over the values of CaseID, Name, Age, Sex, Municipality, and DateOnset.
  • The values in DateSpecimenCollection and DateResultReleased are carried over to DateSpecimenCollection_PositiveFirst and DateResultReleased_PositiveFirst, respectively.
  • DateSpecimenCollection_NegativeFirst and DateResultReleased_NegativeFirst are blank as these are new cases.
  • DateReport will be the date of the day, which in this case is July 16, 2021.

Because the columns of lab_today_pos_nodup_nodup are mapped 1:1 to case_yday, we can first duplicate the tibble lab_today_pos_nodup_nodup as a new tibble called case_today and rename/remove/add blank columns from case_today to resemble case_yday. We rename columns using dplyr’s rename() function (syntax: newname = oldname). We remove columns using dplyr’s select() function and enclose the column names to delete using -(). We add blank date columns by simply creating them and assigning them NA values and declaring their class using as.POSIXct() for dates.

#### Prepare case_today linelist
# Duplicate lab_today_pos_nodup_nodup
case_today <- lab_today_pos_nodup_nodup

# Rename DateSpecimenCollection and DateResultReleased
case_today <- case_today %>% 
  rename(DateSpecimenCollection_PositiveFirst = DateSpecimenCollection,
         DateResultReleased_PositiveFirst = DateResultReleased)

# Remove the LabSpecimenID and Result columns
case_today <- case_today %>% select(-c(LabSpecimenID, Result))

# Add blank date columns
case_today$DateSpecimenCollection_NegativeFirst <- as.POSIXct(NA)
case_today$DateResultReleased_NegativeFirst <- as.POSIXct(NA)

###############################################################################
###############################################################################

# Specify date today
DateToday <- as.POSIXct("2021-07-16")

###############################################################################
###############################################################################

# Add DateReport column
case_today$DateReport <- DateToday

# Move date onset before DateReport
case_today <- case_today %>% relocate(DateOnset, .before = "DateReport")

The case_today linelist should look something like this:

CaseID Name Age Sex Municipality DateSpecimenCollection_PositiveFirst DateResultReleased_PositiveFirst DateSpecimenCollection_NegativeFirst DateResultReleased_NegativeFirst DateOnset DateReport
1 44980 Amal Ford 40 M Grand Wellworth 2021-07-15 2021-07-15 NA NA 2021-07-10 2021-07-16
2 44997 Penelope Fields 16 F Mexe 2021-07-15 2021-07-16 NA NA NA 2021-07-16


6 Adjudicating on Data

6.1 Compiling new cases

Now we decide which information to use from each columns among the new cases in dedup_new_df_jw_manual. Recall the simple adjudication rules:

  • For CaseID, the lower number is kept.
  • For Name, the longer one is kept.
  • For DateOnset, the earlier one is kept.
  • For DateSpecimenCollection, the earlier one is kept for DateSpecimenCollection_PositiveFirst.
  • For DateResultReleased, the earlier one is kept for DateResultReleased_PositiveFirst.
  • DateSpecimenCollection_NegativeFirst and DateResultReleased_NegativeFirst are both blank.
  • DateReport is the date for the day.

Because we tagged the duplicates manually, we would have to select the values we need manually as well. There is no workaround for this, and it might are tempting to do this in Microsoft Excel. DON’T. Resist all temptation. The code ends up long, but the advantage is that we have a clear paper trail of what decisions we have made.

We use dplyr’s group_by() function to tell R that we plan to adjudicate by DuplicateID. Then, we use the dplyr’s summarize() function to define our columns according to case_today as we append them later, as well as implement the adjudication rules. We use the base R functions min() and max() for the ones where we need to identify the minimum and maximum values. For DateOnset, we also have to specify that we have to exclude NA from deciding which is the minimum value. For Name, we use the base R function which.max() to identify the row number of the longest number of characters, and we use the base R function nchar() to calculate the number of characters. We then save our result in the tibble case_today_dedup:

#### Apply adjudication rules
# Use all the .x-suffixed columns
# Adjudication rules
# * For `CaseID`, the lower number is kept.
# * For `Name`, the longer one is kept.
# * For `DateOnset`, the earlier one is kept.
# * For `DateSpecimenCollection`, the earlier one is kept for `DateSpecimenCollection_PositiveFirst`.
# * For `DateResultReleased`, the earlier one is kept for `DateResultReleased_PositiveFirst`.
# * `DateSpecimenCollection_NegativeFirst` and `DateResultReleased_NegativeFirst` are both blank.
# * `DateReport` is the date today.
case_today_dedup <- dedup_new_df_jw_manual %>% group_by(DuplicateID) %>% 
  summarize(CaseID = min(CaseID.x),
            Name = Name.x[which.max(nchar(Name.x))],
            Age = min(Age.x),    # Doesn't matter in our hypothetical example, same value anyway
            Sex = min(Sex.x),    # Doesn't matter in our hypothetical example, same value anyway
            Municipality = min(Municipality.x), # Doesn't matter in our hypothetical example, same value anyway
            DateSpecimenCollection_PositiveFirst = min(DateSpecimenCollection.x),
            DateResultReleased_PositiveFirst = min(DateResultReleased.x),
            DateSpecimenCollection_NegativeFirst = as.POSIXct(NA),
            DateResultReleased_NegativeFirst = as.POSIXct(NA),
            DateOnset = min(DateOnset.x[!is.na(DateOnset.x)]),
            DateReport = DateToday)

# Finally, remove the DuplicateID as we do not need it anymore
case_today_dedup <- case_today_dedup %>% select(-(DuplicateID))

Then, we append the cases from case_today_dedup into case_today and we finally have a clean linelist of cases today - deduplicated and adjudicated. We use the base R function rbind() to do this.

#### Bind the rows together of case_today and case_today_dedup
case_today <- rbind(case_today, case_today_dedup)

Our new case linelist case_today now looks like this:

CaseID Name Age Sex Municipality DateSpecimenCollection_PositiveFirst DateResultReleased_PositiveFirst DateSpecimenCollection_NegativeFirst DateResultReleased_NegativeFirst DateOnset DateReport
1 44980 Amal Ford 40 M Grand Wellworth 2021-07-15 2021-07-15 NA NA 2021-07-10 2021-07-16
2 44997 Penelope Fields 16 F Mexe 2021-07-15 2021-07-16 NA NA NA 2021-07-16
3 44986 Martin F Romero 18 M Port Sipleach 2021-07-15 2021-07-15 NA NA 2021-07-12 2021-07-16
4 44992 Jessica Bauer 3 F Eastmsallbuck Creek 2021-07-15 2021-07-15 NA NA 2021-07-10 2021-07-16
5 44990 Fern Christian Mcarthur 40 M Port Sipleach 2021-07-15 2021-07-15 NA NA 2021-07-14 2021-07-16

6.2 Updating current cases

6.2.1 Retrieving ID pairs

We first retrieve the LabSpecimenID and CaseID pairs from the data frame dedup_df_jw_manual. In that data frame, these correspond to the columns lab_today_id and case_yday_id. We store the ID pairs in a data frame called dedup_df_jw_IDpairs

#### Retrieve ID pairs
dedup_df_jw_IDpairs <- dedup_df_jw_manual %>% select(lab_today_id, case_yday_id)

6.2.2 Merging with the rest of the information

Then, we retrieve all the columns we need from the tibble lab_today and merge them with the lab_today_id in the data frame dedup_df_jw_IDpairs. We store the results in a tibble called dedup_df_jw_allinfo

#### Merge to retrieve all info
# all.x = TRUE means to retain all rows of x regardless if there is a match,
# all.y = FALSE means to drop rows of y if there is no match.
dedup_df_jw_allinfo <- merge(x = dedup_df_jw_IDpairs, y = lab_today,
                             by.x = "lab_today_id", by.y = "LabSpecimenID",
                             all.x = TRUE, all.y = FALSE)

6.2.3 Preparing to add to case linelist

Now we transform the the data frame dedup_df_jw_allinfo into the columns of the tibble case_yday so that it are easier for us to update case information. We store the new tibble as case_yday_newinfo.

  • We carry over the values of CaseID, Name, Age, Sex, Municipality, and DateOnset
  • The values in DateSpecimenCollection and DateResultReleased are assigned to DateSpecimenCollection_PositiveFirst and DateResultReleased_PositiveFirst if the result is positive, and DateSpecimenCollection_NegativeFirst and DateResultReleased_NegativeFirst if the result is negative. We use the dplyr’s mutate() and case_when() functions for this. Note that the pipe operator (%>%) can be used to execute multiple instructions at once.
  • We then remove the columns lab_today_id, CaseID, DateSpecimenCollection, DateResultReleased and Result since we not need them anymore after this. CaseID in this data frame pertain to the autogenerated CaseID from the laboratory linelist. We need the column case_yday_id because that is the ID variable that allow us to link them to the tibble case_yday. then we rename case_yday_id to CaseID.
#### Prepare case_yday_newinfo linelist
# Duplicate dedup_df_jw_allinfo
case_yday_newinfo <- dedup_df_jw_allinfo

# Keep the Name, Age, Sex, Municipality, and DateOnset columns as is
# Note that case_yday_id contains the CaseID as it appears in the case linelist
# CaseID here refers to the autogenerated CaseID from the laboratory linelist

# Using mutate to assign columns of dates of specimen collection and dates of result released
case_yday_newinfo <- case_yday_newinfo %>%
  mutate(DateSpecimenCollection_PositiveFirst = 
           case_when(Result == "Positive" ~ DateSpecimenCollection,
                     NA ~ as.POSIXct(NA))) %>%
  mutate(DateResultReleased_PositiveFirst =
           case_when(Result == "Positive" ~ DateResultReleased,
                     NA ~ as.POSIXct(NA))) %>%
  mutate(DateSpecimenCollection_NegativeFirst = 
           case_when(Result == "Negative" ~ DateSpecimenCollection,
                     NA ~ as.POSIXct(NA))) %>%
  mutate(DateResultReleased_NegativeFirst = 
           case_when(Result == "Negative" ~ DateResultReleased,
                     NA ~ as.POSIXct(NA)))

# Remove extraneous columns
case_yday_newinfo <- case_yday_newinfo %>% 
  select(-c(CaseID, lab_today_id, DateSpecimenCollection, DateResultReleased, Result))

# Rename case_yday_id to CaseID
case_yday_newinfo <- case_yday_newinfo %>% rename(CaseID = case_yday_id)

# Add DateReport
case_yday_newinfo$DateReport <- DateToday

# Relocate DateOnset before DateReport
case_yday_newinfo <- case_yday_newinfo %>% relocate(DateOnset, .before = "DateReport")

The tibble should now look like this:

CaseID Name Age Sex Municipality DateSpecimenCollection_PositiveFirst DateResultReleased_PositiveFirst DateSpecimenCollection_NegativeFirst DateResultReleased_NegativeFirst DateOnset DateReport
1 13788 Eve Mcbride 58 F San Wadhor 2021-07-15 2021-07-15 NA NA 2021-07-10 2021-07-16
2 13479 Ceara West 2 F Chorgains NA NA 2021-07-15 2021-07-16 2021-07-08 2021-07-16
3 12579 Elijah Henson 38 M Mexe NA NA 2021-07-15 2021-07-16 2021-07-06 2021-07-16
4 13597 Agata Lucas 35 F Port Sipleach 2021-07-15 2021-07-16 NA NA NA 2021-07-16
5 13289 Ella-Mai Gregory 58 F Grand Wellworth 2021-07-15 2021-07-16 NA NA NA 2021-07-16
6 13547 Franciszek Vickers 2 M Eastmsallbuck Creek NA NA 2021-07-15 2021-07-16 2021-07-10 2021-07-16
7 13788 Eve Mcbride 58 F San Wadhor 2021-07-16 2021-07-16 NA NA 2021-07-11 2021-07-16
8 18400 Leonidas Hudson 14 M Eastmsallbuck Creek NA NA 2021-07-15 2021-07-15 2021-07-06 2021-07-16
9 18793 Dustin Payne 10 M Grand Wellworth 2021-07-15 2021-07-15 NA NA 2021-07-09 2021-07-16

6.2.4 Add the information from the case linelist for comparison

Now, we add rows to the tibble case_yday_newinfo that contain information from the case linelist. We save these results in a tibble called case_yday_oldinfo. We then combine the two tibbles into a new tibble called case_yday_recon. In this tibble, we implement the adjudication rules and save them in a new tibble called case_yday_update.

  • For Name, the longer one is kept.
  • For all date columns, the earliest one is kept.
#### Prepare case_yday_oldinfo linelist and adjudicate on data
# Retrieve the old case information
case_yday_oldinfo <- case_yday %>% filter(CaseID %in% case_yday_newinfo$CaseID)

# Combine rows and arrange by CaseID
case_yday_recon <- rbind(case_yday_oldinfo, case_yday_newinfo)
case_yday_recon <- case_yday_recon %>% arrange(CaseID)

# Adjudication rules
# * For `Name`, the longer one is kept.
# * For all date columns, the earliest one is kept
case_yday_update <- case_yday_recon %>% group_by(CaseID) %>% 
  summarize(Name = Name[which.max(nchar(Name))],
            Age = min(Age),    # Doesn't matter in our hypothetical example, same value anyway
            Sex = min(Sex),    # Doesn't matter in our hypothetical example, same value anyway
            Municipality = min(Municipality), # Doesn't matter in our hypothetical example, same value anyway
            DateSpecimenCollection_PositiveFirst =
              min(DateSpecimenCollection_PositiveFirst[!is.na(DateSpecimenCollection_PositiveFirst)]),
            DateResultReleased_PositiveFirst = 
              min(DateResultReleased_PositiveFirst[!is.na(DateResultReleased_PositiveFirst)]),
            DateSpecimenCollection_NegativeFirst =
              min(DateSpecimenCollection_NegativeFirst[!is.na(DateSpecimenCollection_NegativeFirst)]),
            DateResultReleased_NegativeFirst = 
              min(DateResultReleased_NegativeFirst[!is.na(DateResultReleased_NegativeFirst)]),
            DateOnset = min(DateOnset[!is.na(DateOnset)]),
            DateReport = min(DateReport[!is.na(DateReport)]))

The tibble case_yday_update should now look like this:

CaseID Name Age Sex Municipality DateSpecimenCollection_PositiveFirst DateResultReleased_PositiveFirst DateSpecimenCollection_NegativeFirst DateResultReleased_NegativeFirst DateOnset DateReport
1 12579 Elijah Henson 38 M Mexe 2021-07-03 2021-07-04 2021-07-10 2021-07-11 2021-07-02 2021-07-05
2 13289 Ella-Mai Gregory 58 F Grand Wellworth 2021-07-02 2021-07-03 NA NA NA 2021-07-04
3 13479 Ceara West 2 F Chorgains 2021-07-01 2021-07-02 2021-07-15 2021-07-16 2021-07-08 2021-07-03
4 13547 Francissek Vickers 2 M Eastmsallbuck Creek 2021-07-10 2021-07-11 2021-07-15 2021-07-16 2021-07-10 2021-07-12
5 13597 Agata Lucas 35 F Port Sipleach 2021-07-02 2021-07-03 NA NA 2021-06-30 2021-07-04
6 13788 Eve M. Mcbride 58 F San Wadhor 2021-07-02 2021-07-02 NA NA 2021-06-30 2021-07-03
7 18400 Leonidas Hudson 14 M Eastmsallbuck Creek 2021-07-07 2021-07-08 2021-07-15 2021-07-15 2021-07-06 2021-07-09
8 18793 Dustin Payne 10 M Grand Wellworth 2021-07-08 2021-07-09 NA NA 2021-07-09 2021-07-10


7 Compiling the Latest Case Linelist

7.1 Putting the processed tibbles altogether

At this point, we have a linelist of processed new cases called case_today, and a linelist of current cases with updated information called case_yday_update. To compile the full latest case linelist, we need one more linelist: current cases with no updates. We filter out CaseID values from the tibble case_yday that are not in case_yday_update. We save this linelist as a new tibble case_yday_noupdate

#### Retrieve cases with no update
case_yday_noupdate <- case_yday %>% filter(!(CaseID %in% case_yday_update$CaseID))

Finally, we can combine the three tibbles case_today, case_yday_update, and case_yday_noupdate into the latest case linelist. We can save this in a tibble called case_latest

#### Combine the three tibbles to the full latest case linelist, and arrange by CaseID
case_latest <- rbind(case_today, case_yday_update, case_yday_noupdate)
case_latest <- case_latest %>% arrange(CaseID)

The latest case linelist now looks like this:

CaseID Name Age Sex Municipality DateSpecimenCollection_PositiveFirst DateResultReleased_PositiveFirst DateSpecimenCollection_NegativeFirst DateResultReleased_NegativeFirst DateOnset DateReport
1 12579 Elijah Henson 38 M Mexe 2021-07-03 2021-07-04 2021-07-10 08:00:00 2021-07-11 08:00:00 2021-07-02 2021-07-05 08:00:00
2 13289 Ella-Mai Gregory 58 F Grand Wellworth 2021-07-02 2021-07-03 NA NA NA 2021-07-04 08:00:00
3 13479 Ceara West 2 F Chorgains 2021-07-01 2021-07-02 2021-07-15 08:00:00 2021-07-16 08:00:00 2021-07-08 2021-07-03 08:00:00
4 13547 Francissek Vickers 2 M Eastmsallbuck Creek 2021-07-10 2021-07-11 2021-07-15 08:00:00 2021-07-16 08:00:00 2021-07-10 2021-07-12 08:00:00
5 13566 Penelope F. Fields 45 F San Wadhor 2021-07-02 2021-07-02 NA NA 2021-07-01 2021-07-03 08:00:00
6 13597 Agata Lucas 35 F Port Sipleach 2021-07-02 2021-07-03 NA NA 2021-06-30 2021-07-04 08:00:00
7 13788 Eve M. Mcbride 58 F San Wadhor 2021-07-02 2021-07-02 NA NA 2021-06-30 2021-07-03 08:00:00
8 18400 Leonidas Hudson 14 M Eastmsallbuck Creek 2021-07-07 2021-07-08 2021-07-15 08:00:00 2021-07-15 08:00:00 2021-07-06 2021-07-09 08:00:00
9 18793 Dustin Payne 10 M Grand Wellworth 2021-07-08 2021-07-09 NA NA 2021-07-09 2021-07-10 08:00:00
10 44980 Amal Ford 40 M Grand Wellworth 2021-07-15 2021-07-15 NA NA 2021-07-10 2021-07-16 00:00:00
11 44986 Martin F Romero 18 M Port Sipleach 2021-07-15 2021-07-15 NA NA 2021-07-12 2021-07-16 00:00:00
12 44990 Fern Christian Mcarthur 40 M Port Sipleach 2021-07-15 2021-07-15 NA NA 2021-07-14 2021-07-16 00:00:00
13 44992 Jessica Bauer 3 F Eastmsallbuck Creek 2021-07-15 2021-07-15 NA NA 2021-07-10 2021-07-16 00:00:00
14 44997 Penelope Fields 16 F Mexe 2021-07-15 2021-07-16 NA NA NA 2021-07-16 00:00:00

7.2 Checking date consistencies

Finally, we check whether the date of symptom onset is on or before the date of specimen collection of the first positive test result. We mentioned earlier that when we find dates of symptom onset that are greater than the date of specimen collection, we replace the date of symptom onset with the date of specimen collection.

#### Correct for date inconsistencies
case_latest <- case_latest %>%
  mutate(DateOnset = 
           case_when(DateOnset <= DateSpecimenCollection_PositiveFirst ~ DateOnset,
                     DateOnset > DateSpecimenCollection_PositiveFirst ~ DateSpecimenCollection_PositiveFirst,
                     NA ~ as.POSIXct(NA)))
CaseID Name Age Sex Municipality DateSpecimenCollection_PositiveFirst DateResultReleased_PositiveFirst DateSpecimenCollection_NegativeFirst DateResultReleased_NegativeFirst DateOnset DateReport
1 12579 Elijah Henson 38 M Mexe 2021-07-03 2021-07-04 2021-07-10 08:00:00 2021-07-11 08:00:00 2021-07-02 2021-07-05 08:00:00
2 13289 Ella-Mai Gregory 58 F Grand Wellworth 2021-07-02 2021-07-03 NA NA 2021-07-02 2021-07-04 08:00:00
3 13479 Ceara West 2 F Chorgains 2021-07-01 2021-07-02 2021-07-15 08:00:00 2021-07-16 08:00:00 2021-07-01 2021-07-03 08:00:00
4 13547 Francissek Vickers 2 M Eastmsallbuck Creek 2021-07-10 2021-07-11 2021-07-15 08:00:00 2021-07-16 08:00:00 2021-07-10 2021-07-12 08:00:00
5 13566 Penelope F. Fields 45 F San Wadhor 2021-07-02 2021-07-02 NA NA 2021-07-01 2021-07-03 08:00:00
6 13597 Agata Lucas 35 F Port Sipleach 2021-07-02 2021-07-03 NA NA 2021-06-30 2021-07-04 08:00:00
7 13788 Eve M. Mcbride 58 F San Wadhor 2021-07-02 2021-07-02 NA NA 2021-06-30 2021-07-03 08:00:00
8 18400 Leonidas Hudson 14 M Eastmsallbuck Creek 2021-07-07 2021-07-08 2021-07-15 08:00:00 2021-07-15 08:00:00 2021-07-06 2021-07-09 08:00:00
9 18793 Dustin Payne 10 M Grand Wellworth 2021-07-08 2021-07-09 NA NA 2021-07-08 2021-07-10 08:00:00
10 44980 Amal Ford 40 M Grand Wellworth 2021-07-15 2021-07-15 NA NA 2021-07-10 2021-07-16 00:00:00
11 44986 Martin F Romero 18 M Port Sipleach 2021-07-15 2021-07-15 NA NA 2021-07-12 2021-07-16 00:00:00
12 44990 Fern Christian Mcarthur 40 M Port Sipleach 2021-07-15 2021-07-15 NA NA 2021-07-14 2021-07-16 00:00:00
13 44992 Jessica Bauer 3 F Eastmsallbuck Creek 2021-07-15 2021-07-15 NA NA 2021-07-10 2021-07-16 00:00:00
14 44997 Penelope Fields 16 F Mexe 2021-07-15 2021-07-16 NA NA NA 2021-07-16 00:00:00

7.3 Export to Micrsoft Excel

If we want to export our tibble case_latest to a Microsoft Excel spreadsheet for archiving, we use writexl’s write_xlsx() function. We will name our file to today’s case linelist, which in this case is fakecaselinelist_2021-07-16.xlsx. This file is what we will use as case_yday for tomorrow’s processing.

##############################################################################
##############################################################################

# Export to Excel
write_xlsx(case_latest, "../cases/fakecaselinelist_2021-07-16.xlsx")

##############################################################################
##############################################################################


8 Helpful Tips when Debugging Code

  1. R is case-sensitive, so always proofread when debugging code.
  2. Pay special attention to the location of the commas, the parentheses, and the brackets.
  3. A “hard reset” I always use is typing rm(list = ls()), which removes everything in the environment, and run the script from scratch again. This is helpful when I do not know exactly the error when debugging code.

Congratulations on getting to the end! The entire process may be challenging, but the beauty of this approach is that after the script file has been setup, we only need to modify the codes that require manual processing (the one in double rows of hash keys #), so the efficiency gains with a data processing workflow in R pays off quickly when we have to process surveillance data on a daily basis.


Comments welcome on Twitter: @jasonhaw_